---
output:
  pdf_document:
    citation_package: natbib
    toc: true
    keep_tex: false
    fig_caption: true
    latex_engine: pdflatex
    template: header.tex
title: "Supplementary Materials -- No Longer Conforming to Stereotypes? Gender, Political Style, and Parliamentary Debate in the UK" 
#thanks: "**This version**: `r format(Sys.time(), '%B %d, %Y')`."  
link-citations: yes
always_allow_html: true
geometry: margin=1in
fontfamily: mathpazo
fontsize: 12pt
bibliography: test
biblio-style: apsr

--- 


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.pos= "h")
library(data.table)
library(knitr)
library(kableExtra)
library(xtable)
library(ggplot2)
library(corrplot)

# Load data

load("working/dictionaries.Rdata")
load("working/speech_scores.Rdata")
load("data/validation_data.Rdata")

speech_scores[,keep:=.N > 4, by = section_id]
speech_scores <- speech_scores[keep == TRUE]
speech_scores <- speech_scores[speech_scores$parliamentary_term != "1992-1997"]

x <- word_scores$affect

word_scores$affect <- word_scores$affect[word_scores$affect$word != "parliamentary_jargon",]
word_scores$posemo <- word_scores$posemo[word_scores$posemo$word != "parliamentary_jargon",]
word_scores$negemo <- word_scores$negemo[word_scores$negemo$word != "parliamentary_jargon",]
word_scores$fact <- word_scores$fact[word_scores$fact$word != "parliamentary_jargon",]
word_scores$anecdote <- word_scores$anecdote[word_scores$anecdote$word != "parliamentary_jargon",]
word_scores$aggression <- word_scores$aggression[word_scores$aggression$word != "parliamentary_jargon",]

added_and_removed_words <- function(x, style = "Something", n = 30){

  top <- x$word[1:n]
    
added <- x[x$sigmoid > .8 & !x$in_original_dictionary][1:n]

removed <- x[x$in_original_dictionary]
removed <- removed[order(removed$score),][1:n]

out <- data.frame(Top = top, Added = added$word, Removed = removed$word)
out$Top <- gsub("_","\\_", out$Top, fixed = T)
out$Added <- gsub("_","\\_", out$Added, fixed = T)
out$Removed <- gsub("_","\\_", out$Removed, fixed = T)

return(out)

 }


```


\setcounter{page}{1}
\setcounter{figure}{0}
\setcounter{table}{0}
\setcounter{equation}{0}
\renewcommand{\thepage}{S\arabic{page}}
\renewcommand{\thesection}{S\arabic{section}}
\renewcommand{\thetable}{S\arabic{table}}
\renewcommand{\thefigure}{S\arabic{figure}}
\renewcommand{\theequation}{S\arabic{equation}}

\doublespacing
\thispagestyle{empty}
\newpage


# Word-embedding-based dictionaries 

Our word-embedding-based measurement strategy consists of several steps, which we describe in more detail in this section. 

First, for each style we define a "seed" dictionary that represents our concept of interest. We use the following sources to construct our seed dictionaries: 

1. **Affect** -- Linguistic Inquiry and Word Count 2015 (Affect) [@Pennebaker2015]
1. **Fact** -- Linguistic Inquiry and Word Count 2015 (Number and Quantitative) [@Pennebaker2015] and all occurrences of any numeric figures
1. **Positive Emotion** -- Regressive Imagery Dictionary (Emotions: Positive Affect) [@Martindale1990]
1. **Negative Emotion** -- Regressive Imagery Dictionary (Emotions: Anxiety and Sadness) [@Martindale1990]
1. **Aggression** -- A bespoke dictionary of words (see figure S1 below)
1. **Human Narrative** -- A bespoke dictionary of words (see figure S2 below) and the 200 most common names of children born between 1970 and 2019


The final two seed dictionaries -- which relate to aggression and human narrative -- are our original constructions. These dictionaries were constructed by reading and watching debates from the House of Commons that are known to feature either aggression (for instance, Prime Minister's Questions) or examples of human narrative (for instance, debates on mental health or social policy issues), and selecting words and phrases that we thought were likely to capture these concepts in a broader set of parliamentary debates. We report the full lists of words that feature in these new seed dictionaries in figures \ref{aggression_dictionary} and \ref{anecdote_dictionary} 

\begin{figure}[htpb!]
\caption{``Aggression'' seed dictionary \label{aggression_dictionary}}
\input{analysis/aggression_words.txt}
\end{figure}

\begin{figure}[htpb!]
\caption{``Human narrative'' seed dictionary \label{anecdote_dictionary}}
\input{analysis/anecdote_words.txt}
\end{figure} 

Second, a key component of our approach to measuring style are a set of word-embeddings, which we estimate from the full corpus of parliamentary speeches. Word-embedding models, which are of increasing use in political science [@Spirling2019], seek to describe any word in a corpus as a dense, real-valued vector of numbers. The construction of the word-embedding vectors, regardless of the specific algorithm used to estimate them, relies centrally on the distributional hypothesis: the idea that words which are used in similar contexts will have similar meanings. Here, a context refers to a window of words around a target word, and the embedding model allows us to *learn* the semantic meaning of each word directly from the use of the word in the corpus. 

The main output of embedding models are the word-embeddings themselves. These are vectors that correspond to each unique word in the corpus. The dimensions of the embedding vectors capture different semantic "meanings" that can be used to provide structure to vocabulary. Crucially for our purposes, given this representation, the distances *between* word-vectors have been shown to effectively capture important semantic similarities between different words [@Mikolov2013]. We use this property to define the set of words that, *in the context of UK parliamentary debate*, are used in a semantically similar fashion to the seed words. 

We follow the estimation procedure outlined in @Pennington2014 and estimate a word embedding, $W$, of length $J = 150$ for each unique word in our corpus. We use a small "context" window size of 3 words either side of the target word to estimate our embeddings. This is consistent with our aim of capturing semantic (rather than topical) relations between words  [@Spirling2019,7]. We exclude all words that occur very rarely (fewer than 90 times overall), and all words that occur very frequently (in more than 90% of documents). We remove all stop-words, punctuation, and a bespoke list of parliamentary address terms such as "Honourable Friend" or "Home Secretary". We collect the embeddings in a matrix, $\theta$, which we use to calculate the mean word-embedding vector for each of our seed dictionaries. The average word-embedding of the seed words represents the "location" of the dictionary in the vector-space defined by the embedding model, and allows us to calculate the relative semantic similarity of different words to the dictionary.

Third, we calculate the similarity between *every* word in the corpus and the mean dictionary word-vector using the cosine-similarity metric. Words closely related to the average semantic meaning of the seed words will have a high similarity score, and words that are less closely related will have a low similarity score. We then follow @Zamani2016 and apply the sigmoid function to the similarity scores, which transforms all similarity scores to the [0,1] interval and shrinks the scores of all but the most similar words to very close to zero. Where $x_w^s$ is the cosine similarity between the word-embedding for word $w$ and the mean word-embedding of the seed dictionary for style $s$, the sigmoid transformation is given by:

\begin{equation}
Sim_{w}^s = \frac{1}{1 + e^{-a(x_{w}^s - c)}}
\end{equation}

Here, $a$ and $c$ are free parameters which we set to be equal to 40 and .35, respectively, based on the results in @Zamani2016[3]. $Sim_{w}^s$ gives our final score for each word for each style. Words closely related to the average semantic meaning of the seed words for a given dictionary will have a high $Sim_{w}^s$, and words that are less closely related will have a low $Sim_{w}^s$.

Finally, we use the word-level scores, $Sim_{w}^s$, to score each *sentence* in the corpus. As described in the main body of the paper, the score for a given sentence on a given dimension is: 

\begin{equation} \label{sim_eq}
Score_{i}^s = \frac{\sum_w^W Sim_{w}^s N_{wi}}{\sum_w^W N_{wi}}
\end{equation}

\noindent where $Sim_{w}^s$ is the similarity score defined above, and $N_{wi}$ is the (weighted) number of times that word $w$ appears in sentence $i$, where the weights are term-frequency inverse-document-frequency weights.[^tf_idf] $Score_{i}^s$ represents the fraction of words in sentence $i$ that are relevant to dictionary $s$. When words with high scores for a given style appear frequently in a given sentence, the sentence will be scored as highly relevant to the style. The score for each *document* is then the weighted average of the relevant sentence level scores, where the weights are equal to the number of words in each sentence. 

[^tf_idf]: TF-IDF weighting is used to down-weight very common words, and up-weight relatively rare words.


\newpage

# Validation tests

As with all quantitative text analysis approaches, careful validation of our measures is essential [@Grimmer2013], and we provide two face validation checks in this section, as well results from a human validation task. 

## Face validity checks

In table \ref{tab:word_concept_table}, we examine the words that are associated with large $Sim_{w}^s$ values for each of our styles. In particular, the table shows the top 30 words associated with each concept according to our word-embedding measure (*Top*), the words that are high-scoring based on the word-embedding measure, but which do not feature in the seed dictionaries (*Added*), and the words that are low-scoring on the word-embedding measure but which did feature in the seed dictionaries (*Removed*). The *Added* words are particularly important, as they represent words that are used in a similar context to the words in our seed dictionary in the parliamentary setting, but which would be missed by traditional dictionary based approaches.

The tables reveal that high-weight words (*Top*) generally correspond very closely to the style dimensions to which they relate. For instance, the top-loading words in the "Positive Emotion" dimension include "joy", "delight", "eager", and "excitement". Similarly, in the "Aggression" dimension, top words include "disgraceful", "shameful", "outrageous", and "scaremongering".  It is also encouraging that the top words in the "Fact" dimension are mostly numeric quantifiers, and the top "Human Narrative" words include "constituent", "told", "wrote", "said", and several words that indicate specific individuals ("son", "father", "wife").

In addition, many words that are not included in the original seed dictionaries are nevertheless given high weights via the word-embedding approach (*Added*). For example, the words "shocking", "incompetence", "pathetic", and "deplore" do not appear in the "Aggression" seed dictionary, but nevertheless receive high weights for that style. That these words are consistent with intuitive notions of these broad stylistic categories, although not in the original dictionaries, highlights the fact that the word-embedding approach is successfully finding words that are semantically closely related to our key concepts of interest. 

Similarly, the table also shows that some words included in the original seed dictionaries which are not semantically similar to the relevant concepts in the context of parliamentary debate are given low weights by the word-embedding approach (*Removed*). For example, that "terrorism" is removed from the "Negative Emotion" dictionary is encouraging, as within a parliamentary context the use of the word "terrorism" is likely to be from a reference to matters of policy rather than to an expression of emotion. 

Overall, the words in table \ref{tab:word_concept_table} suggest that our word-embedding model is a) accurately associating sensible words with our stylistic concepts; and b) capturing language use that is representative of a given style, even when those words are not included in our seed dictionaries, and so would be missed by traditional dictionary approaches.

 

\blandscape

```{r, results='asis'}

affect_ar <- added_and_removed_words(word_scores$affect, style = "Affect")
posemo_ar <- added_and_removed_words(word_scores$posemo, style = "Posemo")
negemo_ar <- added_and_removed_words(word_scores$negemo, style = "Negemo")
fact_ar <- added_and_removed_words(word_scores$fact, style = "Fact")
anecdote_ar <- added_and_removed_words(word_scores$anecdote, style = "Anecdote")
aggression_ar <- added_and_removed_words(word_scores$aggression, style = "Aggression")

out1 <- cbind(affect_ar, posemo_ar)
out2 <- cbind(negemo_ar, aggression_ar)
out3 <- cbind(fact_ar, anecdote_ar)

names(out1) <- paste0("\\emph{",names(out1),"}")
names(out2) <- paste0("\\emph{",names(out2),"}")
names(out3) <- paste0("\\emph{",names(out3),"}")

library(xtable)
add_to_row <- list()
add_to_row$pos <- list(-1) 

add_to_row$command <- paste0('\\multicolumn{3}{c}{\\textbf{Affect}} 
                             & \\multicolumn{3}{c}{\\textbf{Positive Emotion}}\\\\')

print.xtable(xtable(out1, align = "c|ccc|ccc|"), include.rownames = F, comment = F, add.to.row = add_to_row, sanitize.text.function = identity, table.placement = "t")

add_to_row$command <- paste0('\\multicolumn{3}{c}{\\textbf{Negative Emotion}}
                             & \\multicolumn{3}{c}{\\textbf{Aggression}}\\\\')

print.xtable(xtable(out2, align = "c|ccc|ccc|"), include.rownames = F, comment = F, add.to.row = add_to_row, sanitize.text.function = identity, table.placement = "b")

add_to_row$command <- paste0('\\multicolumn{3}{c}{\\textbf{Fact}}
                             & \\multicolumn{3}{c}{\\textbf{Human Narrative}}\\\\')

print.xtable(xtable(out3, align = "c|ccc|ccc|", label = "tab:word_concept_table", caption = "Word-level validation"), include.rownames = F, comment = F, add.to.row = add_to_row, sanitize.text.function = identity, table.placement = "b")

```

\elandscape

\newpage

Tables \ref{tab:sentence_examples_1} and \ref{tab:sentence_examples_2} assess the face validity of our approach by showing the 10 highest scoring *sentences* for each style, according to the $Score_{i}^s$ measure described in equation \ref{sim_eq}. For all styles, the sentences clearly reflect the conceptual definitions we outline in the main paper. For instance, the "fact" category is dominated by statements using numerical language, and the "human narrative" category has many examples of MPs referring to the experiences of specific individuals. This again suggests that our measurement strategy plausibly captures our stylistic dimensions of interest.


\singlespacing

```{r, results='asis'}

load("working/top_sentences.Rdata")

n_sentences <- 10

out1 <- data.frame(`Affect` = top_affect$sent[1:n_sentences],
                   `Positive Emotion` = top_posemo$sent[1:n_sentences],
                   `Human Narrative` = top_anecdote$sent[1:n_sentences])

names(out1) <- c("Affect", "Positive Emotion", "Human Narrative")

out2 <- data.frame(`Aggression` = top_aggression$sent[1:n_sentences], 
                   `Fact` = top_fact$sent[1:n_sentences],
                   `Negative Emotion` = top_negemo$sent[1:n_sentences])

names(out2) <- c("Aggression", "Fact", "Negative Emotion")

```


```{r, results='asis'}


kable(out1, caption = "Top sentences for Affect, Positive Emotion, and Human Narrative \\label{tab:sentence_examples_1}", 
      format = "latex", longtable = T) %>%
  kable_styling(full_width = F, font_size = 9) %>%
  column_spec(1:3, width = "7cm") %>%
  landscape()



```

```{r, results='asis'}

kable(out2, caption = "Top sentences for Aggression, Fact, and Negative Emotion \\label{tab:sentence_examples_2}", 
      format = "latex", longtable = T) %>%
  kable_styling(full_width = F, font_size = 9) %>%
  column_spec(1:3, width = "7cm") %>%
  landscape()

```

\doublespacing

\newpage

## Human validation task

In this section, we provide results from a human validation task which assesses whether our text-based measures of style mirror human judgements of the same concepts. We wrote a web app which presented two research assistants with pairs of sentences (sampled from all sentences in our corpus). Coders were asked to complete two tasks. First, a style-*comparison* task required them to select which of the two sentences was more typical of a particular style. Second, a style-*intensity* task required them to rate the degree to which each sentence was representative of the selected style on a 5 point scale. 


![Human validation task prompt\label{fig:prompt_screen}](app.jpg)

Figure \ref{fig:prompt_screen} gives an example of the prompt seen by our coders. In addition to the sentences themselves, we presented coders with minimal definitions of the speech-styles of interest to ensure that the human coding related to the style dimensions identified in the literature review.  

```{r}
## Intercoder reliability
a <- validation_data[intercoder == TRUE,c("slider_one","text_one_glove","style_type","coder_name")]
b <- validation_data[intercoder == TRUE,c("slider_two","text_one_glove","style_type","coder_name")]
levels_data_intercoder <- rbind(a,b, use.names = FALSE)

```

Each coder completed 70 comparisons per style, on average, meaning that we have on average 140 individual sentence-ratings per style. We use the distribution of responses to these tasks and compare them to the distribution of text-based style measures described in the main body of the paper for the same sentences as seen by the coders.[^inter_coder] 

[^inter_coder]: To assess inter-coder reliability, our research assistants both coded an additional common set of 20 comparisons per style. Coders agreed on which of the two sentences was more representative of a given style in `r round(mean(validation_data[intercoder == TRUE & coder_name == "Alicia"]$first == validation_data[intercoder == TRUE & coder_name == "Agnes"]$first) * 100)`% of comparisons. The correlation for the "intensity" scores for all sentences across coders was `r round(cor(x=levels_data_intercoder$slider_one[levels_data_intercoder$coder_name=="Agnes"], y=levels_data_intercoder$slider_one[levels_data_intercoder$coder_name=="Alicia"]),2)`.

We summarise the results in table \ref{tab:correlation_validation}. The "intensity task" column presents the correlation between our sentence-level style measures (equation \ref{sim_eq}) and our coders' ratings of the same styles. For the "comparison task" column, we calculate the difference in the sentence-level scores for each pair of sentences, and correlate that with the choices made by our coders from the comparison task. 


```{r}

cors <- validation_data[,list(round(cor(glove_difference_std, first),2), n = .N), by = style_type]

a <- validation_data[,c("slider_one","text_one_glove","style_type")]
b <- validation_data[,c("slider_two","text_two_glove", "style_type")]
levels_data <- rbind(a,b, use.names = FALSE)

cors_levels <- levels_data[,list(round(cor(slider_one, text_one_glove),2), n = .N),by = style_type]

cors_dic <- validation_data[,list(round(cor(dic_difference, first, use = "complete.obs"),2), n = .N), by = style_type]
cors_dic <- cors_dic[cors_dic$style_type != "Complexity"]
cors_dic <- cors_dic[cors_dic$style_type != "Repetition"]

a <- validation_data[,c("slider_one","text_one_dic_prop","style_type")]
b <- validation_data[,c("slider_two","text_two_dic_prop", "style_type")]
levels_data_dic <- rbind(a,b, use.names=FALSE)

cors_levels_dic <- levels_data_dic[,list(round(cor(slider_one, text_one_dic_prop, use = "complete.obs"),2), n = .N),by = style_type]
cors_levels_dic <- subset(cors_levels_dic, subset = (style_type != "Complexity") & (style_type != "Repetition"))

names(cors) <- c("style_type", "glove_cor", "glove_n")
names(cors_levels) <- c("style_type", "glove_cor_level", "glove_n")

names(cors_dic) <- c("style_type", "dic_cor", "glove_n")
names(cors_levels_dic) <- c("style_type", "dic_cor_level", "glove_n")

tmp <- merge(cors[,1:2], cors_levels[,1:2], by = "style_type", all = TRUE)
tmp <- merge(tmp, cors_dic[,1:2], by = "style_type", all = TRUE)
tmp <- merge(tmp, cors_levels_dic[,1:2], by = "style_type", all = TRUE)

tmp_for_text <- tmp

tmp$pairwise_cor <- ifelse(!is.na(tmp$dic_cor), paste0(tmp$glove_cor, " (", tmp$dic_cor,")"), tmp$glove_cor)
tmp$level_cor <- ifelse(!is.na(tmp$dic_cor_level), paste0(tmp$glove_cor_level, " (", tmp$dic_cor_level,")"), tmp$glove_cor_level)

tmp <- tmp[,c("style_type", "pairwise_cor", "level_cor")]
tmp$style_type <- gsub(" 2","", tmp$style_type)
tmp$style_type[tmp$style_type == "Emotion"] <- "Affect"
tmp <- tmp[match(c("Human Narrative", "Affect", "Positive Emotion", "Negative Emotion", "Fact", "Aggression", "Complexity", "Repetition"), tmp$style_type),]

library(kableExtra)

kable(tmp, format = "latex", col.names = c("Style type", "Comparison task", "Intensity task"), caption = "Correlation between text-based measures and human judgments\\label{tab:correlation_validation}. ", booktabs = T)

```

Overall, the results are very encouraging. Across all styles, the correlation between the text-based scores and the human validation is always positive and is never lower than `r min(c(tmp_for_text$glove_cor, tmp_for_text$glove_cor_level))` for either task. These results suggest that there is a clear correspondence between the measures of style implied by our text-analysis approach, and human judgements of those concepts in the same set of texts.[^repetition_validation]

[^repetition_validation]: As repetitiveness is a quantity that manifests more clearly *across* rather than *within* sentences, our sentence-based human validation is somewhat less well suited to evaluating this concept. Nevertheless, the sentences that our measure marks as most repetitive do clearly demonstrate high levels of repetitiveness, and, as table \ref{tab:correlation_validation} indicates, even though detecting repetitiveness at the sentence-level might represent a hard task, we recover a clear correspondence between our measures and human judgements of the same concept. 

Moreover, we can compare our measures with standard dictionary-based measurement approaches. For all styles except for repetition and complexity, we compare our word-embedding approach to an approach that measures style using the proportion of words in each sentence that appears in a pre-defined dictionary. This measurement strategy is more typical of existing applications of dictionaries in political science, and forms the basis of the analysis in several previous studies on gender and political style [e.g., @Gleason2019; @Jones2016; @Yu2013]. To maximise comparability, the dictionaries we use for this analysis are the same as the seed dictionaries we use to construct our word-embedding scores:

- *Affect* -- Linguistic Inquiry and Word Count 2015 (Affect) [@Pennebaker2015] 
- *Fact* -- Linguistic Inquiry and Word Count 2015 (Number and Quantitative) [@Pennebaker2015] and all occurrences of any numeric figures 
- *Positive Emotion* -- Regressive Imagery Dictionary (Emotions: Positive Affect) [@Martindale1990] 
- *Negative Emotion* -- Regressive Imagery Dictionary (Emotions: Anxiety and Sadness) [@Martindale1990]
- *Aggression* -- our bespoke dictionary of words shown in figure \ref{aggression_dictionary}
- *Human Narrative* -- our bespoke dictionary of shown in figure \ref{anecdote_dictionary} and the 200 most common names of children born between 1970 and 2019. 

This means that, for each sentence in our corpus, we have a measure of style based on our word embedding method (described in equation 1 in the paper), and a measure of style based on counting the fraction of words in the sentence that fall into the relevant style's seed dictionary.

The results are given in table \ref{tab:correlation_validation}. The numbers in parentheses show the correlation between the standard dictionary measure of style described above, and human judgements provided by our coders. Our word-embedding approach clearly outperforms standard dictionary approaches in approximating human judgement. For instance, for positive emotion, standard dictionary measures correlate at `r tmp_for_text$dic_cor[tmp_for_text$style_type=="Positive Emotion"]` and `r tmp_for_text$dic_cor_level[tmp_for_text$style_type=="Positive Emotion"]` with human codings for the two tasks, compared to `r tmp_for_text$glove_cor[tmp_for_text$style_type=="Positive Emotion"]` and `r tmp_for_text$glove_cor_level[tmp_for_text$style_type=="Positive Emotion"]` for the word-embedding approach. Despite the relatively small sample sizes, the magnitude of the difference in predictive power means that -- in all cases except for "fact" -- the correlation between our word-embedding measures and human codings is significantly higher than the equivalent correlation for standard dictionary measures.[^bootstrap] Overall, this exercise provides strong evidence that we can reliably detect our styles of interest in parliamentary speech and outperform the standard measures used in previous studies on gender and political style. 


[^bootstrap]: We determine this difference by using a bootstrap procedure, in which we sample from our set of sentences 2000 times with replacement and calculate the correlation between our word-embedding measures and human codings, and between the dictionary measures and human codings, on each iteration. We can easily reject the null hypothesis of no difference in these correlations for all styles except for the "fact" dimension.

\newpage

# Controlling for individual-level covariates

In this section we show results of the alternative specification for the dynamic hierarchical model described in the paper in which we expand the model at the second level by including a vector of individual-level covariates, $X_{j,t}^k$:

\begin{eqnarray}\label{eq:model2:2nd:control}
\alpha_{j,t} \sim N(\mu_{0,t} + \mu_{1,t} Female_j + \sum_{k = 1}^k \lambda_k X_{j,t}^k, \sigma_{\alpha})
\end{eqnarray}

\noindent where $X_{j,t}^k$ includes: 

- Party (categorical: Conservative; Labour; Liberal Democrat; Other)
- Government or opposition party status (binary)
- Government or opposition frontbench position (binary)
- Committee chair (binary)
- MP age (in years, continuous) 
- Margin of victory in prior election (percentage points, continuous)
- University degree (binary)
- Prior occupation (categorical: manual; professional; political; business; other)

We transform the two continuous predictors such that they have mean zero, and standard deviation one. We present the results for our main quantities of interest ($\mu_{1,t}$) estimated from this model in figure \ref{fig:model2_time_control}. 

\afterpage{
\blandscape

\begin{figure}
\parbox[c][\textwidth][s]{\linewidth}{%
\vfill
\begin{center}
\includegraphics{analysis/plots/model2_gender_effect_time_control.pdf}
\caption{Gender differences in style over time controlling for individual-level confounders}
\label{fig:model2_time_control}
\end{center}
\vfill
}
\end{figure}

\elandscape
}

The figure shows that, in general, we recover very similar patterns of gender differences in style use over time when controlling for individual-level covariates. For human narrative, affect, positive emotion, negative emotion, fact, and aggression the trajectories of the gender differences over time are very similar to those presented in the main body of the paper. The largest differences are for complexity and repetition, where the pattern of convergence between men and women is somewhat attenuated in the estimates from the alternative specification. For complexity in particular, the large shift in the gender difference that we observe between 2008 and 2013 is confounded by some of the individual-level covariates, as the gender difference is largely constant (and indistinguishable from zero) for the entire time period once we control for these other factors. Nevertheless, overall, these results suggest that while other MP-level characteristics clearly account for some variation in style use, our central finding -- that the debating styles of male and female MPs have diverged from gender-based stereotypes over time -- is not affected by these estimates.

Figure \ref{fig:model2_covariate_effects} presents the estimates for each of the individual-level covariates for each style. Although these are not our primary quantities of interest, there are several patterns that are of substantive interest. First, we find, consistent with other work [@Proksch2019], that MPs from government parties use significantly less negative and more positive language than MPs from opposition parties. Government MPs are also less aggressive and tend to rely more on human narrative and less on fact-based arguments than their opposition counterparts. Second, compared with backbench MPs, politicians in leadership positions are less likely to use human narrative, more likely to make fact-based arguments, use substantially less emotive language, and are more repetitious in their speeches. We also see some evidence of partisan differences. Compared to Conservative Party MPs, Labour MPs use more human narrative, more factual language, and are somewhat less complex in their speeches. Liberal Democrat MPs, by contrast, make less use of human narrative, more use of fact, and are substantially less aggressive than Conservative MPs. There are also interesting patterns in speech styles according to the education and occupation variables. For instance, university-educated MPs tend to make less use of human narrative, and less use of negative emotional language, but deliver speeches that are more complex and more repetitious than their non-university educated counterparts. With regard to prior employment, MPs from manual occupations do appear to have distinct speechmaking styles, as they employ more human narrative, and less aggressive and repetitive language than MPs from other employment backgrounds. Overall, it is clear that there are many factors that influence the political styles that MPs adopt and, while these are not directly relevant to the substantive questions in our study, we think that these findings may be profitably investigated in future work.

\afterpage{
\blandscape

\begin{figure}
\parbox[c][\textwidth][s]{\linewidth}{%
\vfill
\begin{center}
\includegraphics{analysis/plots/model2_covariate_effects.pdf}
\caption{Individual-level covariate effects}
\label{fig:model2_covariate_effects}
\end{center}
\vfill
}
\end{figure}

\elandscape
}


\newpage

# Style use and debate-type

Our model accounts for aggregate differences in style use across debates via the $\delta_d$ random-effects described in equation 2 in the main body. The inclusion of these parameters means that gender differences in style use cannot be attributed to men and women participating in systematically different types of debates, as the gender effects we estimate are based on within-debate variation in the style outcomes. However, it is possible that the magnitude of gender differences nevertheless varies across debates of different types. We investigate this possibility here. Specifically, we separate the debates in our data into common types that occur regularly in the UK House of Commons [for more detail, see @Blumenau2020a]:

1. **All**: all debates in our dataset. 
1. **Ministerial Question Time**: the routine questioning of Ministers, occurs four times a week.
1. **Prime Minister's Question Time**: the Prime Minister answers questions from the Leader of the Opposition, opposition members and government backbenchers, occurs once a week. 
1. **Procedural debates**: a compound category that includes debates that are not substantive in nature, but deal with matters of parliamentary procedure or scheduling. For example, Business of the House or Points of Order. 
1. **Legislation**: debates on legislation, includes all stages of the process that occur in the Commons' chamber, such as second and third reading.
1. **Opposition Days and Backbench Business**: this includes business for debate that is placed on the parliamentary agenda by opposition members or backbenchers. 
1. **Other**: all other forms of debate that are not captured by the above categories. 

This categorisation captures important substantive differences between different types of debates in the House of Commons, some of which have been shown to be predictive of MPs' style in previous work [@Osnabrugge2020].

We run a series of OLS models for each of our outcomes, where our main explanatory variable of interest is the gender of the MP, and where we also control for party, age, years in parliament, margin of victory in the previous election, degree education, previous occupation, and whether the MP was a) a member of the cabinet, b) a member of the shadow cabinet membership, c) a government minister, d) a shadow minister, or e) a committee chair. For each outcome, we subset the data to only debates of a certain type, estimate the model, and record the coefficient on the gender variable at each iteration. Figure \ref{fig:debate_type_models} shows, for each style, the gender differences in the seven different debate types. 

```{r, fig.width=8, fig.height=6, fig.cap= "Debate type models\\label{fig:debate_type_models}", fig.pos="h"}

load("working/debate_model_out.Rdata")
coef_list <- coef_list_debate_models 

all_out <- lapply(1:length(coef_list), function(x) {
  tmp <- coef_list[[x]]$all
  tmp$style <- names(coef_list)[x]
  tmp[rownames(tmp) == "genderFemale",]
})

questions_out <- lapply(1:length(coef_list), function(x) {
  tmp <- coef_list[[x]]$questions
  tmp$style <- names(coef_list)[x]
  tmp[rownames(tmp) == "genderFemale",]
})

pmqs_out <- lapply(1:length(coef_list), function(x) {
  tmp <- coef_list[[x]]$PMQs
  tmp$style <- names(coef_list)[x]
  tmp[rownames(tmp) == "genderFemale",]
})

procedure_out <- lapply(1:length(coef_list), function(x) {
  tmp <- coef_list[[x]]$procedure
  tmp$style <- names(coef_list)[x]
  tmp[rownames(tmp) == "genderFemale",]
})

legislation_out <- lapply(1:length(coef_list), function(x) {
  tmp <- coef_list[[x]]$legislation
  tmp$style <- names(coef_list)[x]
  tmp[rownames(tmp) == "genderFemale",]
})

opp_bbb_out <- lapply(1:length(coef_list), function(x) {
  tmp <- coef_list[[x]]$opp_bbb
  tmp$style <- names(coef_list)[x]
  tmp[rownames(tmp) == "genderFemale",]
})

other_out <- lapply(1:length(coef_list), function(x) {
  tmp <- coef_list[[x]]$other
  tmp$style <- names(coef_list)[x]
  tmp[rownames(tmp) == "genderFemale",]
})

all_out <- data.frame(do.call("rbind", all_out))
questions_out <- data.frame(do.call("rbind", questions_out))
pmqs_out <- data.frame(do.call("rbind", pmqs_out))
procedure_out <- data.frame(do.call("rbind", procedure_out))
legislation_out <- data.frame(do.call("rbind", legislation_out))
opp_bbb_out <- data.frame(do.call("rbind", opp_bbb_out))
other_out <- data.frame(do.call("rbind", other_out))

all_out$model <- "All"
questions_out$model <- "Questions"
pmqs_out$model <- "PMQs"
procedure_out$model <- "Procedure"
legislation_out$model <- "Legislation"
opp_bbb_out$model <- "Opposition/Backbench"
other_out$model <- "Other"

out <- rbind(all_out, questions_out, pmqs_out, procedure_out, legislation_out, opp_bbb_out, other_out)
out$model <- factor(out$model, levels = c("All", "Questions", "PMQs", "Procedure", "Legislation", "Opposition/Backbench", "Petitions", "Other"))
out$style[out$style == "affect_std"] <- "Affect"
out$style[out$style == "posemo_std"] <- "Positive Emotion"
out$style[out$style == "negemo_std"] <- "Negative Emotion"
out$style[out$style == "fact_std"] <- "Fact"
out$style[out$style == "anecdote_std"] <- "Anecdote"
out$style[out$style == "aggression_std"] <- "Aggression"
out$style[out$style == "complexity_std"] <- "Complexity"
out$style[out$style == "repetition_std"] <- "Repetition"

ggplot(out, aes(x = est, xmin = lo, xmax = hi, y = style, col = model)) + 
  geom_point(aes(shape=model)) + geom_errorbarh(height = .001) + 
  theme_bw() + 
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 15)) + 
  xlab("Female MP difference") + 
  ylab("") + geom_vline(xintercept = 0, linetype = 2) 

```

The analysis reveals that the magnitude of average gender differences are relatively constant across the debate types. In the debate types we identify, Prime Minister's Questions seems to be the only type of debate that significantly effects the gender coefficients. We see that, relative to the model which pools across all debates, the magnitude of gender differences is increased for repetition, aggression, and affect; decreased for negative emotion; and reduces gender differences in fact to statistically indistinguishable from zero. Overall, however, while there is some variation in the magnitude of gender differences across debate types, these differences are for the most part very small. 

In figure \ref{fig:model2_debate_average} we show additional descriptive information on the average level of each style in speeches used across the different debate types. The patterns in style use across debates generally conform with standard intuitions. For instance, the figure shows that both Question Time and Prime Ministers Questions (PMQ) debates are substantially less positive than debates on legislation, which is consistent with the idea that these settings are used by the opposition parties to interrogate -- and often castigate -- the government on issues of the day. Similarly, both PMQ debates and debates initiated by the Opposition parties in parliament are more aggressive than other debates, which again follows the intuition that these debates are mainly used as a vehicle for criticising government policy. In general, these descriptive figures bolster the results from our validation exercises above, as they imply that our measures accurately capture expected differences in speech style across different types of parliamentary debate.

\afterpage{
\blandscape

\begin{figure}
\parbox[c][\textwidth][s]{\linewidth}{%
\vfill
\begin{center}
\includegraphics{analysis/plots/model2_debate_effects.pdf}
\caption{Style type average by debate type}
\label{fig:model2_debate_average}
\end{center}
\vfill
}
\end{figure}

\elandscape
}

\newpage

# Within-MP and replacement effects

Does gender explain less variation in aggregate style use over time because of a gradual convergence in styles of female and male MPs throughout their careers in parliament? Or do gender gaps decrease because the men and women entering parliament over time are systematically different from those leaving parliament? Which of these two explanations -- which we refer to as "within-MP" and "replacement" effects -- is responsible for the aggregate patterns we document in the main body of the paper? Our modelling approach allows us to decompose the evolving gender differences that we report in the section above into these two mechanisms of change.

Given the model described by equations 2 and 3 in the main body of the paper, we can decompose the shifting patterns of gendered style use into those changes that stem from within-MP change over time, and those that come from replacement. Our goal is to specify a decomposition of $\mu_{0,t} - \mu_{0,t-1}$, which is the change in average style use for men between parliamentary session $t$ and session $t-1$ (we can then provide an equivalent approach for female MPs). We begin by distinguishing between three types of MP, which we label as "remainers", "joiners", and "leavers":
\begin{itemize}
\item $J_m^R$ is the set of male MPs who appear in both session $t$ and $t-1$ (Remainers)
\item $J_m^J$ is the set of men who appear in $t$ but not in $t-1$ (Joiners)
\item $J_m^L$ is the set who appear in $t-1$ and not in $t$ (Leavers)
\end{itemize}

We also will require the fraction of men who are "remainers" in $t$ and $t-1$:
\begin{itemize}
\item $\pi^R_t$ is the fraction of male MPs in $t$ who also served in $t-1$
\item $\pi^R_{t-1}$ is the fraction of male MPs in $t-1$ who also served in $t$
\end{itemize}
Note that the proportion of male MPs who are "remainers" in $t$ may be different from the proportion in $t-1$, because some male MPs who leave parliament in $t-1$ will be replaced by women in $t$ (and vice versa).

Given these definitions, we can write the mean style use for men in each period as a function of the MP-period effects ($\alpha_{j,t}$):

\begin{eqnarray}
\mu_{0,t-1}^m &=& \underbrace{\pi^R_{t-1}\frac{1}{|J_m^R|}\sum_{j\in J_m^R} \alpha_{j,t-1}}_\text{Remaining MPs}  + \underbrace{(1 - \pi^R_{t-1})\frac{1}{|J_m^L|}\sum_{j\in J_m^L} \alpha_{j,t-1}}_\text{Leaving MPs} \label{style_mean_t0} \\
\mu_{0,t}^m &=& \underbrace{\pi^R_{t}\frac{1}{|J_m^R|}\sum_{j\in J_m^R} \alpha_{j,t}}_\text{Remaining MPs} + \underbrace{(1 - \pi^R_{t})\frac{1}{|J_m^J|}\sum_{j\in J_m^J} \alpha_{j,t}}_\text{Joining MPs} \label{style_mean_t1}
\end{eqnarray}

Here, $\mu_{0,t-1}$ is a weighted average of the finite-sample average of the "remainers" and "leavers" in $t-1$, where the weights are given by the relative proportion of those groups in that parliamentary session. $\mu_{0,t}$ is constituted from the equivalent averages for "remainers" and "joiners" in time period $t$, again weighted by the size of those two groups in $t$.

Taking the difference between \ref{style_mean_t0} and \ref{style_mean_t1} and rearranging reveals an additive decomposition which separates the two effects of interest:

\begin{eqnarray}
\mu_{0,t}^m - \mu_{0,t-1}^m &=& \underbrace{\pi^R_{t}\frac{1}{|J_m^R|}\sum_{j\in J_m^R} \alpha_{j,t} - \pi^R_{t-1}\frac{1}{|J_m^R|}\sum_{j\in J_m^R} \alpha_{j,t-1}}_\text{``Within-MP'' effect ($S_m$)} + \nonumber \\ && \underbrace{(1-\pi^R_{t}) \frac{1}{|J_m^J|}\sum_{j\in J_m^J} \alpha_{j,t}  - (1- \pi^R_{t-1})\frac{1}{|J_m^L|}\sum_{j\in J_m^L} \alpha_{j,t-1}}_\text{``Replacement'' effect ($R_m$)} \label{style_difference}
\end{eqnarray}

We denote the within-MP effect for men as $W_m$ and the replacement effect as $R_m$. We can also, of course, define the same quantities for female MPs, and therefore can describe the changing gender difference in terms of replacement and socialisation effects:

\begin{equation}\label{eq:net_within_replacement_difference}
(\mu_{0,t}^w - \mu_{0,t}^m) - (\mu_{0,t-1}^w - \mu_{0,t-1}^m) = \underbrace{(W_w - W_m)}_\text{``Within-MP'' difference} -  \underbrace{(R_w - R_m)}_\text{``Replacement'' difference}
\end{equation}

Turning to our results, we plot these quantities in the left (for male MPs) and centre (for female MPs) panels of figure \ref{fig:model2_socialisation_replacement}. The x-axis describes the average direction and magnitude of changes between parliamentary sessions for each style for men and women, respectively. The right-hand panel reports *the difference* in the effects for women and men. In each panel, hollow points show changes that occur because of replacement, and solid points show changes that occur due to within-MP shifts. 
 
\afterpage{
\blandscape

\begin{figure}
\parbox[c][\textwidth][s]{\linewidth}{%
\vfill
\begin{center}
\includegraphics{analysis/plots/model2_replacement_socialisation_decomposition_average_uncertainty.pdf}
\caption{Within-MP and replacement over time change by gender}
\label{fig:model2_socialisation_replacement}
\end{center}
\vfill
}
\end{figure}

\elandscape
}

We use these plots to understand whether replacement or within-MP change is a stronger determinant of the aggregate shifts we observe. Overall, neither differential replacement nor within-MP change alone explain the convergence that we document across multiple different styles in the main body of the paper, though there is some evidence that replacement is more important as a mechanism for explaining the changing gender dynamics we observe for the "agentic" styles while within-MP change is somewhat more important for explaining change for more "communal" styles.

For example, figure 2 in the main body of the paper shows that women are much more likely than men to use negative emotion in their speeches in later years, but only somewhat more likely in the earlier years. The middle panel of figure \ref{fig:model2_socialisation_replacement} shows that the replacement effect for women for negative emotion is positive (the hollow point for negative emotion is greater than zero), which implies that newly elected women are more negative than the women leaving parliament, on average. However, the left panel of figure \ref{fig:model2_socialisation_replacement} suggests that this is *not* true for male MPs: male MPs joining parliament use negative language at the same rate on average as male MPs leaving parliament (the hollow point for negative emotion is close to zero). Consequently, the right panel suggests a (positive) differential replacement effect for negative emotion. Note that the difference between male and female *within-MP* effects is close to zero for negative emotion. This suggests that the divergence between men and women that we note at the aggregate level is almost entirely driven by differential replacement between male and female MPs, rather than existing MPs becoming more alike in their behaviour over time. 

The right-hand panel of \ref{fig:model2_socialisation_replacement} indicates that, beyond negative emotion, replacement effects also account for a greater share of the aggregate change in gender differences for factual language and complexity. For both, while the women entering parliament are significantly more likely to use these styles than the women leaving, newly elected male MPs employ these styles at broadly similar rates as the men that they replace. By contrast, both male *and* female MPs are less likely to use factual language as their careers in parliament progress, and the speeches of both men *and* women become more complex the longer they spend in parliament. Consequently, the large aggregate shifts that we observe for these styles are largely driven by the fact that the women newly elected to parliament adopted a legislative debating style that was more factual, complex, and negative than the women they replaced.

For other styles, we see that within-MP change accounts for a greater share of the variation in gender differences. For instance, the gradually decreasing gap in the use of human narrative in figure 2 in the main body of the paper is mostly attributable to women MPs using this style less the longer that they stay in parliament, but the decreasing use of human narrative for male MPs is much smaller. Similarly, on average women employ less positive emotion over time, whereas the positive language use of male MPs remains relatively constant. Conversely, within-MP change in affect for men is positive, implying that men become more emotional overall in their speeches over time, but there is very little average within-MP change in affect for women. These results imply that, for these style types, the convergence that we see in the main analysis is driven by the different stylistic trajectories than male and female MPs appear to follow throughout their tenure in parliament.

\newpage

# Topic-based confounding

We present evidence of convergence between men and women with respect to several debating styles over time. One potential concern for the interpretation of our results is that the parliamentary agenda is not fixed, and changes to the set of issues under discussion may result in convergence between men and women even in the absence of behaviour change. 

Consider, for instance, a style like human narrative, where we observe a large convergence between men and women over time. Women are significantly more likely to use human narrative in their parliamentary speeches at the beginning of the time period than they are at the end. If, however, women are more likely to use human narrative than men in certain *topics*, and those topics become less prevalent over time, then the convergence we document might in fact be attributable to changes to the parliamentary agenda. For changes in topic prevalence to be responsible for convergence, it would have to be the case that the topics on which we observe women using *more* human narrative than men are becoming *less* prevalent, or that the topics on which we see women using *less* human narrative than men are become *more* prevalent over time.  For example, perhaps women use more human narrative than men when discussing education policy, and education policy is more frequently discussed in the early period in our data than the later period in our data. If this were true, then our results might be subject to topical confounding, as changes in topical prevalence over time would account for the aggregate changes we observe in the main analysis.

To address this concern, in this section we use statistical topic models to evaluate whether topics on which we observe notable stylistic differences between men and women become more or less prevalent over time. We begin by estimating a correlated topic model [@blei2006correlated] (CTM) for all speeches in our data. The CTM is an unsupervised learning approach which assumes that the frequency with which words co-occur within different speeches provides information about the topics that feature in those speeches. As with other topic models, the CTM requires the analyst to choose the number of topics, $K$. Given that our results might be sensitive to this choice, we choose to present results from a series of models, where we vary the number of topics: $K \in 10, 20, ..., 80$. We implement the CTM as the null form of the Structural Topic Model, which we implement in R [@roberts2014stm].

The key output of the topic model is $\theta$, a $N*D$ matrix of topic proportions that measures the degree to which each speech ($i$) in the data features each of the estimated topics ($d$). $\theta_{i,k}$ therefore gives the proportion of speech $i$ devoted to topic $d$. With these topics in hand, we then evaluate -- for each of our 8 styles -- the size of the stylistic gender gap between men and women on each topic. To do so, we estimate models where we interact the gender of the MP delivering a speech with the topic proportions that pertain to that speech:

\begin{equation}\label{eq:topic_first_stage}
y_{i(j)}^s = \alpha + \beta^1 Female_j + \sum_{k = 2}^K \beta_k^2 \theta_{i,k} + \sum_{k = 2}^K \beta_k^3 (Gender_i \cdot \theta_{i,k}) + \epsilon_{i(j)}
\end{equation}

We use the coefficients of this model to calculate estimated average differences between men and women on speeches devoted to each topic, which we denote as:

\begin{equation}
\label{eq:cases}
\delta_k^s = 
\begin{cases} 
\beta^1 & \mbox{if } k = 1 \\ 
\beta^1 + \beta_k^3 & \mbox{if } k \neq 1 
\end{cases} 
\end{equation}

The average difference in style $s$ between men and women on speeches that are entirely devoted to topic 1 is given by $\beta^1$ (i.e. the baseline), and $\beta^1 + \beta_k^3$ captures the average gender difference in style on speeches entirely devoted to topic $k$. We denote the gender difference on each topic and style as $\delta_k^s$. This specification allows us to capture the aggregate differences between male and female use of a style on each topic. Positive values for $\delta_k^s$ indicate that women use the style more than men in a given topic, and negative values suggest that women use the style less than men in a given topic. 

We then estimate a second set of regression models to capture, for each topic, the relationship between time and topic prevalence. To do so, we first multiply the number of words in each speech by the vector of topic proportions for that speech, giving us the weighted number of words dedicated to a given topic for each speech in the data. We then sum these topic-weighted word counts across all speeches within a given calendar month, and use the summed word counts as the dependent variable for regressions of the form:

\begin{equation}\label{eq:topic_second_stage}
y_{t}^k = \alpha + \gamma_k YearMon_t + \epsilon_{t}
\end{equation}

Here, $y_{t}^k$ is the number of words on topic $k$ in time period $t$, and $\gamma_k$ captures the linear relationship between time and topic prevalence for topic $k$. Positive values of $\gamma_k$ imply that topic $k$ becomes more prevalent in parliamentary debate throughout the study period, and negative values suggest that the topic becomes less prevalent over time.  

If the topical confounding argument is correct, then for a style like human narrative -- where we observe average convergence between men and women over time -- it must be the case that there is a negative relationship between the gender gap on that topic and the relationship between topic and time. That is, topics where women use human narrative more than men (positive coefficient from equation \ref{eq:topic_first_stage}) should be becoming less prevalent over time (negative coefficient from equation \ref{eq:topic_second_stage}). 

The topical-confounding hypothesis implies different relationships between topical gender-gaps and changes in topic prevalence over time for different styles. For instance, for human narrative, our main analysis shows that women are more likely to use this style in the early period of our data and less in the later period. For this style, topical confounding would occur if topics where women use narrative *more* on average than men (positive $\delta_k^s$ from equation \ref{eq:cases}) became *less* prevalent over time (negative $\gamma_k$ from equation \ref{eq:topic_second_stage}), or the topics where women use narrative *less* than men (negative $\delta_k^s$ from equation \ref{eq:cases}) became *more* prevalent over time (positive $\gamma_k$ from equation \ref{eq:topic_second_stage}). For human narrative, then, the topical-confounding hypothesis implies a negative relationship between the two sets of coefficients.

On the other hand, our aggregate results suggest that women are less aggressive than men in the early period of the data but are equally as aggressive later in the period. Accordingly, if this convergence can be explained by changes to the topics under discussion, it must be the case that the topics on which women tend to be less aggressive than men (negative $\delta_k^s$ from equation \ref{eq:cases}) become less prevalent over time (negative $\gamma_k$ from equation \ref{eq:topic_second_stage}), or that the topics on which women tend to be more aggressive than men (positive $\delta_k^s$ from equation \ref{eq:cases}) become more prevalent over time (positive $\gamma_k$ from equation \ref{eq:topic_second_stage}). Therefore, for aggression, the topical confounding hypothesis implies a positive relationship between the two sets of coefficients.

Following this logic through all eight style types, the topical-confounding explanation suggests that we should observe a positive relationship between $\gamma_k$ and $\delta_k^s$ for aggression, complexity, fact and negative emotion, and a negative relationship between $\gamma_k$ and $\delta_k^s$ for human narrative, affect, positive emotion, and repetition. 

In figure \ref{fig:gap_vs_time} we evaluate these expectations by plotting the estimated values of $\gamma_k$ and $\delta_k^s$ against each other for each style. In this plot, each point represents a single topic from our $K = 40$ topic model: the x-axis measures the gender gap in the use of a given style ($\delta_k^s$), and the y-axis measures the changing prevalence of the topic over time ($\gamma_k$). We also fit a regression line between the sets of coefficients, which is coloured in red if the slope of the line is associated with a p-value of less than 0.05, and otherwise is coloured in grey.

\afterpage{
\blandscape

\begin{figure}
\parbox[c][\textwidth][s]{\linewidth}{%
\vfill
\begin{center}
\includegraphics{analysis/plots/topic/gap_vs_time_40.pdf}
\caption{\textbf{Topical-confounding}: The figure shows the relationship between the gender gap in the use of a given style on a given topic (x-axis), and the change in the prevalence of a given topic over time (y-axis). }
\label{fig:gap_vs_time}
\end{center}
\vfill
}
\end{figure}

\elandscape
}


The main implication of this analysis is straightforward: we find very little evidence to support the topical-confounding hypothesis. The size of the gender gap measured for a given style on a given topic largely does not predict the degree to which that topic becomes more or less prevalent over time. For three of the styles -- aggression, negative emotion, and fact -- the relationships in figure \ref{fig:gap_vs_time} are negative, where they would need to be positive for topical changes to explain the stylistic convergence we document in the main body of the paper. We also find a relationship that is in the "wrong" direction for repetition (that is, although statistically significant, the relationship would need to be negative to cause concern), and there is also essentially no relationship between the gender gap in human narrative on different topics and the changing prevalence of those topics over time. For the remaining styles -- affect, positive emotion, and complexity -- we do find some evidence that topics on which women display more of these styles become more prevalent over time, but the relationships are very noisy and in none of those cases are we able to reject the null hypothesis of a relationship of zero.

As there is no *a priori* reason to base our inferences on the $K = 40$ topic model, in figure \ref{fig:gap_vs_time_all_k} we summarise the relevant results from all 8 topic model specifications. In this plot, the x-axis measures the value for $K$, and the y-axis measures the slope of the regression line for the changing prevalence of a topic over time ($\gamma_k$) as a function of the gender gap in the use of a given style in that topic ($\delta_k^s$). The results clearly demonstrate that our findings are not sensitive to the number of topics used in the analysis. For all models, we find patterns that are very similar to those depicted in figure \ref{fig:gap_vs_time}. The only exception is that we find a significant coefficient for the "fact" style in the $K = 80$ topic model. However, again, this relationship is in the "wrong" direction as it suggests a negative relationship between the topic-specific gender-gap in factual language and over-time topic prevalence, where the topical confounding story implies a positive relationship between these quantities for the factual language style. 

Taken together, these analyses imply that the aggregate patterns we observe in the main body of the paper cannot be convincingly explained by changes to the parliamentary agenda over time. 


\afterpage{
\blandscape

\begin{figure}
\parbox[c][\textwidth][s]{\linewidth}{%
\vfill
\begin{center}
\includegraphics{analysis/plots/topic/gap_vs_time_all_topics.pdf}
\caption{\textbf{Topical-confounding, varying $K$}: On the y-axis, the figure summarises the linear relationship between the gender gap in the use of a given style on a given topic ($\delta_k^s$), and the change in the prevalence of a given topic over time ($\gamma_k$). The x-axis measures the number of CTM topics, $K$, used to estimate these relationships.}
\label{fig:gap_vs_time_all_k}
\end{center}
\vfill
}
\end{figure}

\elandscape
}


\newpage

# Style use and debate participation

Our results show that, on average, female MPs deliver speeches that are less likely to be marked by "communal" styles and more by "agentic" style over time. One potential alternative explanation for our results is that male and female MPs who employ different speaking styles might have become differentially likely to *participate* in parliamentary debate over time. We might imagine, for instance, that female MPs who tend to deliver highly agentic speeches gave more speeches in parliament over the course of the study period, and that women who tend to deliver highly communal speeches participated less in debate over time. If that were the case, differential participation might drive the changing gender speechmaking dynamics that we document in the paper, rather than within-MP changes. 

To investigate this alternative explanation, we assess whether the average style of an MP across all speeches in a given parliamentary term predicts the number of speeches that the MP delivers. We begin by measuring the number of speeches delivered by each MP in each parliamentary term ($\text{\# Speeches}_{i(t)}$), which we then model as a function of the gender of the MP, the average style of speeches given by the MP in that term ($Style_{i(t)}^s$), and the interaction between these two variables. Specifically, for each parliamentary term, $t$, and each style, $s$, we estimate a model of the following form:
\begin{equation}\label{eq:style_use}
\text{\# Speeches}_{i(t)} = \alpha + \beta_1 Female_{i} + \beta_2 Style_{i(t)}^s + \beta_3 (Female_{i} \cdot Style_{i(t)}^s) + \epsilon_{i(t)}
\end{equation}

Our key quantities of interest here are $\beta_2$, which measures the effect of a standard deviation increase in the use of a given style on the number of speeches delivered by men, and $\beta_2 + \beta_3$, which gives the same quantity for female MPs. If our results are driven by a selection-based story about the types of MPs who choose to participate in debate, then we should find that these two quantities broadly mirror the aggregate patterns we document in figure 2 of the paper. For example, if differential participation is the explanation for the decreasing average use of "human narrative" by female MPs, then we should observe a weaker relationship between the degree to which a female MP's speeches tend to feature human narrative and the number of speeches delivered by that MP over time. Similarly, for "negative emotion", if selection into debate drives the increasing use of that style by women, we would expect to see the relationship between the use of negative emotion and the number of speeches delivered by female MPs to have strengthened over time. We present our quantities of interest for each style in each parliamentary term in figure \ref{fig:style_use_speech_rate}.

\afterpage{
\blandscape

\begin{figure}
\parbox[c][\textwidth][s]{\linewidth}{%
\vfill
\begin{center}
\includegraphics{analysis/plots/style_use_speech_rate.pdf}
\caption{\textbf{Participation as a function of average style use, by parliamentary term}: The figure illustrates the average marginal effect of a one standard deviation increase in the average style use on the number of times an MP speaks in a given parliamentary term.}
\label{fig:style_use_speech_rate}
\end{center}
\vfill
}
\end{figure}

\elandscape
}


In general, we find very little evidence that the average style of an MP predicts participation in debate at any point during the study period. Across almost all styles, the effects are indistinguishable from zero, implying that it is very unlikely that our results are driven by which MPs choose to speak in debate. Moreover, there are no clear over-time trends in these coefficients, which undermines the idea that, for example, women with more agentic speaking styles participate more over time. In other words, this analysis suggests that the sample of speeches that we observe do *not* appear to be disproportionately delivered by the more "communal" female MPs in the early period, and by more "agentic" female MPs in the later period. Rather, this analysis suggests that the changes over time that we document in the paper are largely driven by within-MP changes in speaking style, and the replacement of MPs with different style-types over time (see figure \ref{fig:model2_socialisation_replacement} above). 
